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Abstract 

The use of last generation Programmable Electronic Components makes possible the 
construction of very powerful and competitive special purpose computers. We have 
designed, constructed and tested a three-dimensional Spin Glass model dedicated 
machine, which consists of 12 identical boards. Each single board can simulate 8 
different systems, updating all the systems at every clock cycle. The update speed 
of the whole machine is 217ps/spin with 48 MHz clock frequency. A device devoted 
to fast random number generation has been developed and included in every board. 
The on-board reprogrammability permits us to change easily the lattice size, or even 
the update algorithm or the action. We present here a detailed description of the 
machine and the first runs using the Heat Bath algorithm. 
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1 Introduction 



Two approaches have become popular in the field of computer design for sci- 
entific calculations: special or general purpose computers. Lattice Monte Carlo 
in Quantum Field Theory and Statistical Mechanics requires large computa- 
tional power in relatively general purpose computers and the processing can 
often be parallelized. Various groups have developed their own parallel ma- 
chines for those simulations [1] [2] [3] [4]. Those general purpose computers 
require continuous technological upgrading and investment to obtain compet- 
itive results. On the other hand, special purpose computers can approach very 
specific problems, achieving better performance than general computers. 

The emergence in the market of Complex Programmable Logic Devices (CPLD) 
makes it possible to design dedicated machines with low cost and high per- 
formance. In this paper we describe a CPLD-based machine, dedicated to 
three-dimensional spin glass models with variables belonging to Z 2 and cou- 
plings to first neighbours, and report on the reliability tests which have been 
carried out. 

Our machine is called SUE, for Spin Updating Engine, because its task is to 
generate sets of updated spin configurations in the Monte Carlo simulation. 
In a previous work [5], we presented the prototype for the two-dimensional 
model, and introduced the first ideas about the final version. After checking 
that the 2d version worked properly, we have designed, constructed and tested 
the 3d version, which differs in some aspects from the 2d version as we will 
see below. The performance of the 3d machine is improved due to the fact 
that it can run more than a single model: the lattice size or the action of the 
physical model can be easily changed using the on-board reprogrammability of 
the CPLDs. A device devoted to generate a 32-bit random number has been 
developed and included in every SUE board. This device (described below) 
enables SUE to operate with both canonical and microcanonical algorithms. 

At present, spin glass models [6] are a progressing area of Statistical Mechan- 
ics. They are related to neural networks, spin models, some High T c supercon- 
ductivity models, etc. There is large activity in the 3d models because of the 
uncertainty in the vacuum structure at low temperature. Monte Carlo sim- 
ulations of spin glass systems have been used to study the phase transition, 
the ultrametric structure and the dynamics out of equilibrium [7]. Only sizes 
up to L = 16 have been simulated [8] [9] [10], due to the slow dynamics of the 
systems and the strong slowing-down as the size grows. Yet, those simulations 
requiring very simple calculations, they are easily implementable in a dedi- 
cated machine. That way the computational power needed to obtain results 
in larger lattices is obtained. 
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A standard way of studying spin glasses is the use of independent lattices with 
the same quenched couplings, called replicas. The overlap between two replicas 
acts as the order parameter in that model. A great improvement on the usual 
Monte Carlo scheme is the parallel tempering method [7]. The basic idea is 
to move in temperature space: the system changes its temperature, goes up 
to the paramagnetic phase and eventually goes back to lower temperature. 
With high probability in its motion through temperature the system will visit 
new local minima. That scheme has been implemented in SUE: Replicas at 
different temperatures are simulated, and systems running at adjacent levels 
can be swapped according to an appropriate probability distribution. 

An essential tool for the analysis of results is Finite Size Scaling [11], which 
requires the use of different volumes. In that sense, SUE is capable of working 
with different lattice sizes by reprogramming its CPLD devices. 

The main differences with respect to the 2d prototype presented in [5] are: 

• Larger and faster devices. 

• Multi-Layer instead of Double-Layer Printed Circuits. 

• On-board reprogrammability. 

• Dedicated device for 32-bit random number generation. 

• Demon and Heat Bath algorithm support. 

• Parallel Tempering implementation. 

• Driver and Software development for easy (transparent) use. 

At present we have built 12 boards and tested them using the Demon and 
Heat Bath algorithms in different lattice sizes. Each board simulates 8 lattices, 
updating 8 spins every 20.8ns cycle. The update speed of a single board is 
therefore 2.6ns /spin. The cost of each board is 2400 Euros (500 for PCB and 
mounting, and 1900 for components.) 

The summary of this paper is as follows: We start introducing the physical 
model in the next section. In section 3 We describe the electronic architecture 
of SUE, design considerations and software support. The development process 
is outlined in section 4. Last section is devoted to discuss the performance. 



2 The Physical Model 



We want to simulate the 3D Edwards-Anderson model with first neighbour 
couplings (see [6] for a detailed description of the model). The action of this 
model is given by 

E = ^a i a j J ij) (1) 
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where the value of the Ising spins a can be 1 or —1, and the couplings Jy are 
random variables taking the values ±1 with equal probability. For a fixed set 
of couplings { Jij} the partition function is 

Z(P,{Jij}) = EapW},M). (2) 
M 



We study the existence of phase transitions using as order parameter the over- 
lap between two independent systems (replicas) with the same set of couplings 
Jij. We should finally average over different realizations of the disorder (J^) 
to obtain physical results about the system. 

To calculate (2) we must sum over 2 V possible configurations, where V is 
the volume of the lattice, which is a very large number for any computer. 
The standard way to compute the partition function is to run an algorithm 
that selects only a representative set of configurations. There are different 
appropiate algorithms, see for instance the chapter by Sokal in [12]. For pure 
spin systems (all equal to 1) some cluster algorithms are very efficient, but 
for a general spin glass model only local algorithms achieve good efficiency 

Typically, we must run the algorithm and generate millions of different rep- 
resentative configurations in order to obtain accurate results. The autocorre- 
lation time t is a measure of the correlation between configurations: a run of 
length n provides only ~ n/r effectively independent samples. Near the criti- 
cal point, r diverges as r ~ L 2 , where L is the size of the system and z, the 
dynamical critical exponent, has been found to be around 6 (while in the pure 
Ising model it is close to 2). This strong slowing-down, due to the existence 
of many pure and metastable states, and the absence of non local algorithms, 
makes this problem really hard from a computational point of view. 

Two different updating algorithms have been implemented in our design, one 
microcanonical (Demon) and one canonical (Heat Bath). 

The Demon algorithm [13] [14] keeps the sum of the lattice energy and a demon 
energy constant. In order to generate the representative set of samples, we start 
from a spin configuration with an action S and a demon energy equal to zero. 
Now we use the algorithm to change the spins to generate new configurations 
(one for every V updates). The update of a spin is as follows: if the flip lowers 
the spin energy, the demon takes that energy and the flip is accepted. On 
the other hand, if the flip increases the spin energy, the change is only made 
if the demon energy is sufficient to transfer that energy to the system. The 
conservation of the total energy (lattice plus demon), has been useful in the 
programming/test stage, allowing fast tests of proper function. 
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In the Heat Bath algorithm [12], the new spin value for each site Oi is inde- 
pendent from the old one, and its probability distribution is that of a single 
Ising spin <7j in the effective magnetic field produced by the fixed neighbouring 
spins of 

P(rr I irr I \- ^ ( j(T > J 'i ' j (o\ 



The drawback of this canonical algorithm is the necessity of a random number 
to decide the acceptance of the new spin. 

The algorithms being local, to update a spin only the nearest neighbours 
are needed. Because of simplicity in the electronic design, we use helicoidal 
boundary conditions. Let us consider a lattice of side L and volume V = L 3 
with sites labelled in the standard way: the site [x, y, z] (with 0<x, y, z<L — 1) 
gets the index n = x + yxL + zxL 2 . We will call x + (y + , z + ) the neighbour 
in the positive direction along the x (y, z) axis. With our helicoidal boundary 
conditions the neighbours of the site n are simply: 

x + = (n + 1) mod V 

y+ = (n + L) mod V (4) 
z + — (n + L 2 ) mod V 



We define in an analogous manner the remaining neighbours x_, y_ and Z-. 




■ B I □ 



Fig. 1. Schematic view of the full d=3 machine. 
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3 Operation and General Structure of SUE 



The SUE machine is connected to a Host Computer (HC) running under 
Linux. SUE performs the update of the configurations, but the measurements 
and analysis are made by the HC. SUE is set up with initial spin configura- 
tions, couplings and several simulation parameters. Then SUE is started and 
simulation begins. After a certain number of iterations, SUE is stopped to 
download the configuration to the HC and SUE keeps the updating process. 
In this sense, SUE and the HC work in parallel: while SUE is updating the 
system the HC processes the previously read configurations. 

Fig. 1 shows a simple diagram of the whole machine which consists of the HC 
and n SUE boards (the figure is for n — 8, but the final system consists of 
12 processing modules). They are connected to the HC through a PCI Data 
Acquisition Card. Every processing module contains the hardware to store and 
update eight lattices in parallel. Note that there are two degrees of parallelism: 
inside the processing module and between the modules. 

Every clock cycle, the random number generator device included in each board 
provides a pseudo-random number which is shared for the update of the eight 
lattices, so the replicas (systems with the same couplings J^) must be simu- 
lated in different boards. We can then think of each pair of boards as a unit, 
allowing us to simulate eight pairs of replicas (corresponding to eight realiza- 
tions of disorder J^). Periodically, the configurations are read and the relevant 
measurements carried out and stored. 

Parallel tempering requires the simulation of pairs of replicas with the same 
couplings at different values of (3. With 12 boards we could then simulate repli- 
cas corresponding to eight sets of couplings at 6 values of f3 at once. Parallel 
tempering requires more temperature levels, so each time the configuration is 
read the (3 is changed (the corresponding probability table is loaded), and the 
configuration to be updated is loaded onto the board (in the meantime it was 
stored in the HC). Different temperatures are then sequentially simulated. 

The HC controls this mechanism, and is responsible for deciding whether 
the configurations being simulated at adjacent temperatures are interchanged. 
Given the configurations X at temperature (3 and X' at temperature (3\ we 
compute 

A = {(3' - (3){E{X) - E{X')) (5) 

and use a Metropolis like test: if A < we accept the change, otherwise we 
swap the configurations with probability exp (—A). 
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By processing 8 spins in parallel on 12 modules (96 spins in total) within 
one clock cycle (clock period of 48 MHz), we obtain an update speed of 217 
ps/spin. The time spent reading, writing and processing (Meassurement and 
Paralell Tempering) the configurations is around 4% of the computation time 
in the smallest simulable lattice (L = 20), and decreases steeply with size (it 
is less than 1% of the total time for L = 30). 

Let us describe the main characteristics of a SUE board. Devices used are 
listed in table 1, apart from passive components (resistors, diodes, capacitors, 
leds, etc.). The main electronic devices are the Altera 10K CPLDs [15]. 

The photograph of one of the boards can be seen in fig. 2. It contains four 
devices FLEX 10K30 responsible for the core of the Monte Carlo simulation 
( UPDATE area on the figure). That four devices have the same electronic logic 
inside, which is prepared to update two lattices in parallel. 

On the right of these chips are the static memory devices (SRAM) which store 
the couplings of the lattices (J-MEMORY). On the left, each UPDATE device 
has two SRAM devices which store the spin variables (SPIN MEMORY). 
Latch devices are used as tristate devices to manage the polarity of the data 
buses at high frequency. 

The RNG device is a FLEX 10K50 where a random number generator is pro- 
grammed, allowing the use of canonic simulations. Addressing of the memories 
and sincronization between the devices are the main tasks of the fifth FLEX 
10K30 (ADDRESS). The coupling memories are addressed through latch de- 
vices to avoid fan-out problems. 

External communication is provided by three EPM7032 chips placed near the 
68-pin connector. One of them controls the board when the on-board pro- 



Qty 


Type 


Component 


Manufacturer 


5 


CPLD 


FLEX 10K30 


ALTERA 


1 


CPLD 


FLEX 10K50 


ALTERA 


2 


PLD 


EPM 7032-10 


ALTERA 


1 


PLD 


EPM 7032-7 


ALTERA 


26 


SRAM 


CY7C1031 


CYPRESS 


11 


LATCH 


CY162841 


CYPRESS 


6 


PLL 


CY2308-4 


CYPRESS 


1 


OSCILLATOR 


SG615P 


SEIKO EPSON 



Table 1 

Active Components in SUE 
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Fig. 2. SUE board 



grammable devices are not yet programmed. It responds to basic commands 
sent from the HC, allowing to select and program the board. The programmed 
logic establishes 4 control lines in each direction allowing communication be- 
tween the HC an the ADDRESS device, and a 32-bit data bus common to the 
HC and the UPDATE and RNG devices. The two lower bits in this bus reach 
ADDRESS too, and act as extra control lines when needed. 

The clock signal is distributed to all the synchronous devices in the board 
through Cypress 2308-4 PLL devices. In the upper right corner, a set of leds 
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permits us to visualize the state of the machine. The connection to the HC is 
made through a 68 pin (SCSI-2 type) connector. The SUE boards can share 
the same bus for an easy management from the HC. 

Once the general architecture of a SUE board has been outlined, the internal 
details are explained more deeply in the next subsections. 

3.1 Updating Logic 

Four Altera 10K30 devices are responsible for the update. To each one of those 
devices lines are assigned to access the spin and coupling memories and the 
32-bit bus through which the random number is provided. That bus is used 
also to write and read the memories, the demon energy or the probability 
table from the HC. 

In order to obtain an updated spin every clock cycle we have designed a 
pipeline structure that performs the algorithm step by step: A state machine 
runs over a 10 states cycle during the simulation, one spin being put into the 
updating pipeline at each step. 

We have already mentioned that both algorithms are local. Indeed, the devices 
that actually perform the updating ignore where in the lattice is the site being 
updated, or the size of the system. They just process their input data and 
output the updated spins. There is another device (ADDRESS) which takes 
care of the geometry and addresses the memories accordingly. That component 
is also responsible for stopping the simulation when the desired number of 
configurations has been calculated. 

3.2 Memory Scheme 

Two different memory banks, for spins and couplings, are available to each 
UPDATE device (SPIN MEMORY and J -MEMORY areas in fig. 2). 

Couplings are not dynamic variables (their values remain constant during 
the simulation), so the coupling memories are always in reading mode during 
update. When a site is to be updated, the couplings with its neigbours x+, X-, 
y + , y-, z+, Z- must be supplied to the updating engine. We organize therefore 
the 18 SRAM devices as a single bank of width 3x16 bits and depth 6 x 6AK. 
Each lattice takes 6 of the 48 bits to store the six needed couplings. The 
maximum simulable volume is then limited by the depth to L=73. 

The spins changing during the simulation, an appropiate mechanism is needed 
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in order to read and write the configurations simultaneously: The spin memory 
is duplicated (P and Q banks), and while one memory bank is read from the 
other is written on. Because of that, two memory devices are connected to 
every UPDATE component, each one capable to store QAKx 18 bits. 

In order to understand how the spin memory is managed, let us consider each 
column along the x axis of the lattice divided in blocks of fixed length /. 

To update one of the blocks, the block itself and its four y and z neighbours 
have to be supplied. So, five blocks must be read to update one, implying that, 
if an updated spin every clock cycle is wanted, the block length has to be at 
least five (see subsection 3.3 below). 

Each spin memory device of 18 bit words stores two lattices, so 9 bits are 
available for each lattice. Each block can contain from 5 to 9 spins, and the 
maximum lattice size L ~ (I x 64i^) 1//3 that can be stored is 68, 73 , 77, 80 or 
84, depending on the selected block size. This limitation, together with the one 
we found from the coupling memory and the fact that the number of blocks 
must be even, yields the range of simulable sizes shown in table 2. 



1=5 1=6 1=7 1=8 1=9 
20 24 28 
30 32 

36 36 
40 42 

48 48 
50 54 
56 

60 60 64 

70 72 

72 

Table 2 

Simulable lattice sizes 

The spin memories are arranged in the following way: Each 9-bit word con- 
tains / consecutive spins (5</<9), being consecutive (along x axis) lattice 
blocks stored in consecutive memory addresses. The V/l words are not read 
consecutively, but following a pattern that makes the block to be updated and 
its neighbouring blocks available to the UPDATE component, as explained in 
the next subsection. 

On the other hand, the coupling memories store in the n th 6-bit word the 
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couplings of the n spin with its six neighbours, and is read sequentially as 
the V spins are updated. 



SPIN BLOCK REGISTERS 
A B C D E 



SPIN MEMORY 
DATA PORT 1 



SPIN MEMORY 
DATA PORT 2 



OUTPUT 
PORT 



J-MEMORY 
DATA PORT 



SPIN & 
NEIGHBOURS 




COUPLINGS 



ENERGY 
BALANCE 



h- 1 , ^NEW 
SPIN SPIN 



UPDATED 
SPIN BLOCK 



DEMON 
ENERGY 



ADDER 



Fig. 3. Demon algorithm pipeline implemented in the UPDATE devices 



3.3 Pipelined Updating 



In this subsection we describe the logic programmed in every UPDATE device. 
We consider the case in which the demon algorithm is used with a block length 
I = 5 (see fig. 3). In this case, the algorithm runs over a state machine with 
ten states. In each of those states a block is read from bank Q, which is in 
reading mode, and stored in one of the internal registers A... J. Let us suppose 
we have already been in states 0...4, and some registers are already loaded: A 
(z- neighbouring block), B (y_ neighbouring block), C (block to be updated), 
D (y + neighbouring block) and E (z + neighbouring block). 

In state 5, we send for update the first spin in the block stored in C. The x + 
neighbour is in the same block, the x_ neighbour is the previous updated spin, 
which is still in the updating process (and will not be needed until the last 
step), and the rest of the neighbours are stored in blocks A,B,D,E. We read 
simultaneously the z_ neighbouring block of the next block to be updated and 
store it in register F. 

In states 6 to 8, we continue sending for update the spins second to fourth in 
block C, and loading registers G (next block y_ neighbour), H (next block to 
be updated) and I (next block y + neighbour). In state 9, the last spin in block 
C is sent into the update pipeline. It is no longer true that the x + neighbour 
is in the same block, but it is in the block we have already stored in register 
H. Register J (next block z + neighbour) is loaded. When we return to state 
0, the updating of the block registered in H starts. 
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After some cycles, the updated value of the block that was stored in C has 
been calculated and is written on the appropiate memory position in bank P, 
in writing mode. Writing follows the same scheme as reading, not only the 
updated values are written but also the unchanged neighbours. 

When a whole column (containing L/l blocks of I spins each) has been up- 
dated, we change the role of the memories: we will now write on bank Q, the 
memory bank we were previously reading from, and read from bank P, the 
bank we were writing on. Bank P stores now the correct new configuration. 
Bank Q stores the old configuration, which shall be overwritten with the result 
of updating the column read from bank P. 

We have seen that the writing and reading sequences are equal, although 
writing is obviously delayed with respect to reading several cycles. Due to this 
delay, to avoid problems in the change of role of the spin memory banks the 
first block of the updated column, which should be read while its bank is still 
in writing mode, is stored in a cache memory inside the UPDATE devices. 
This mechanism requires at least four blocks, making the minimum simulable 
size to be L = 20. 

3.4 Addressing Logic 

The ADDRESS device in fig. 2 controls and addresses the memories, estab- 
lishes the functioning mode of the UPDATE and RNG devices and takes charge 
of the communications with the HC. As we said above, The board is accessed 
through a communication port with 32 bidirectional lines devoted to data 
transfer and 8 control lines (4 in each direction). Data lines are connected 
through tri-state circuits to the bus connecting the UPDATE and RNG de- 
vices. Control lines are connected to the ADDRESS chip, which controls the 
board according to the commands sent from the HC. 

The implemented instruction set allows us to: 

• program the devices. 

• read/write the spin and coupling configurations. 

• read/write the demon energies. 

• load the number generator initialization table. 

• load the probability tables used in the Heat Bath algorithm. 

• set the number of iterations to run. 

• start the simulation. 

The ADDRESS device controls the UPDATE and RNG devices to carry out 
that operations. A 3-bit wide bus is used to encode the instructions for the 
UPDATE devices. 
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3.5 Random Number Generator 



The Altera 10K50 device (RNG chip in fig. 2) is a 32-bit pseudo-random 
number generator of the R250 kind. Those generators are known to suffer some 
problems in Monte Carlo simulations, but only with non-local algorithms [16]. 

In the C implementation, a vector is initialized with a conventional pseudo- 
random number generator. Using the macro instruction RANDOM we run 
over the wheel, getting a new number and changing one of the values in the 
wheel: 

#define RANDOM ( (irr[ip++] — irr [ipl++] +irr [ip2++] ) irr[ip3++] ) 

The variables involved need to be properly initialized before using the defined 
macro: 

/* random number generator initialization */ 

unsigned int irr [256] ; 

unsigned char ip, ipl, ip2, ip3; 

ip=128; 
ipl=ip-24; 
ip2=ip-55; 
ip3=ip-61 ; 

for(i=0; i<256; i++) 

irr [i] = (unsigned int) rand() ; 

In the RNG device (see Fig. 4), the irr[i] wheel becomes a 32-bit wide shift 
register, reproducing that way the effect of incrementing ip,ipl,ip2 and ip3. 
An adder sums the words WordA and WordB and stores the result in the first 
position IN_SR. This result also serves as input to a XOR function, together 
with the value in the last register WordC. The result of this function provides 
us with the pseudo-random number, every clock cycle. 

The seeds loading process is controlled by the ADDRESS component, which 
also enables the random number generation during the simulation. 

3. 6 Software 

The boards are connected to the HC through a data acquisition card PCI- 
DI032HS from National Instruments. To access the DAQ, a Linux driver has 
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Fig. 4. Random Number Generator 

been programmed, and also a user library allowing to operate with the boards 
in an easy way. 

The functions available to the user are the following: 

• dioinit : initializes the DAQ boards to be used by the HC. 

• boardsel : selects one board among those connected to the HC. 

• ws : writes the spin configuration corresponding to one of the UPDATE 
devices in the selected board. 

• rs : reads the spin configuration. 

• wj : writes the couplings of the lattices in the selected board. 

• r j : reads the couplings in the selected board. 

• rd : reads the Demon energy. 

• wd : writes the Demon energy. 

• wmesf r : sets the number of iterations in each run. 

• wrng : writes the initial random number table. 

• wprob : writes the probability table 2 on UPDATE. 

• st art sue : starts the simulation in the selected board. 

• wait sue : waits for one of the boards to finish. 



The functions to access the memories get the arguments as arrays of bytes, 
where each element corresponds to one site in the lattice and the eight bits in 
the element to each one of the lattices in the board, as is usual in multi-spin 
code. The user needs not worry about SUE internal details. 



3.7 Design considerations 
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3. 7. 1 Programming method 



CPLDs are electronic devices which can be programmed as many times as 
needed. They lose their program code every time the board is switched off, so 
they have to be reprogrammed after switch on. To manage the programming 
task, these devices are connected sequentially creating a JTAG chain which 
is controlled by the HC through the communication port described above. No 
extra cables are needed, providing easy on-board reprogrammability controlled 
from the HC. This feature was extremelly important during the debug process 
of the boards 



3. 7. 2 Printed Circuit Board 

The printed circuit board surface is 24.5 x 30.5cm 2 , and it is 2mm thick. 
Manufactured in FR4 fiber, it consists of eight layers (four dedicated to signal 
transimision and the others to powering). 

The board satisfies the ATX standard. In the final version, the 12 boards are 
mounted in a rack and are fed by a 800 Wat source at 5 V. Current, voltage 
and temperature are monitored. Full operation values are 90 Amp at 5 V. 



3. 7. 3 Frequency 

The proper working of the circuit requires perfect synchronization between 
the active devices. The working frequency is 48 MHz, and the clock signal 
should reach the 32 components spread over a 747cm 2 surface. 

The clock distribution is made through CY2308-4 devices (3.3V Zero Delay 
Buffers), provided with a PLL mechanism {Phase Locked Loop) that allows to 
double the input frequency and supply eight outputs. 

A 12 MHz oscillator is connected to a PLL device that doubles its frequency. 
Five outputs are driven into PLL components that double the frequency again 
and feed the neighbouring components. In this way, the clock is distributed 
across the circuit at low frequency, and the frequency doubled near the final 
components. 



3.7.4 Transmission lines 

As a consequence of the large size of the circuit, there exist connections with 
a large total trace length. The rise times of the signals determine whether the 
transmission line behaves like a distributed circuit or not. 
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The effective length associated with the rise time of a signal is 

T 



(6) 



where T r is the rise time and D the propagation delay, characteristic of the ma- 
terial. We must consider a distributed circuit if the length of the transmission 
line is greater than a quarter of the effective length. 

In our board, diode barriers protect the traces addressing the coupling mem- 
ories, the data bus connecting UPDATE and RNG devices, and the connector 
for external communication. The rest of the signals, generated by memory de- 
vices or Altera 10K components (the latter allow the user to set the rise time), 
have rise times short enough for the system to behave in a lumped fashion. 



4 Development Process 

Initially, only one board was manufactured. In a first stage, we tested its gen- 
eral performance. After being able to communicate with the machine, the pro- 
gramming mechanism was implemented. Different test programs were written 
and compiled to program the CPLDs, using Altera's MAXPlus+II develop- 
ment enviroment. Once we checked that all the components worked properly 
(fixing some electrical bugs in the way), the Demon algorithm was progres- 
sively implemented. We chose the demon algorithm because it is microcanon- 
ical, so the random number generator is not needed and the conservation of 
the total energy provides a fast test mechanism. Additional functionalities 
were added and the algortithm scheme was fine-tuned, until the program was 
complete. 

When the rest of the boards were available, they were tested with this Demon 
program, and the Heat Bath algorithm was then implemented. The structure 
of the algorithm remained almost the same, although some details in the al- 
gorithm had to be changed and some new functions were added to the user 
library. The main novelty was the random number generator usage, which 
worked finally with an appropriate pipelining scheme both in RNG and UP- 
DATE chips. 

In the early debugging stage, development had been carried at 24 MHz, so wc 
switched to high frequency. Some fine-tuning in the programs was needed and 
the CPLD logic layout was carefully studied in order to reach the design-goal 
of 48 MHz. 

To make sure of the proper working of the machine beyond any doubt, an 
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emulator was developed to run in a PC, so the machine configurations can 
be compared with those obtained with the PC emulation. This test proved 
that both the updating algorithm and the random number generator worked 
as intended. 



5 Performance 

Table 3 compares the update speed achieved by SUE with that of some sim- 
ulations run by our group in different computers, and with the performance 
obtained running highly optimized multi-spin code in a Cray T3E supercom- 
puter as reported in [17]. We can see that the whole machine matches the 
computational power of one hundred processors of a CrayT3E. 



System 


Update speed (ns/spin) 


Pentium Pro 200 MHz 


170 


Pentium II 500 MHz 


102 


Alpha 133 MHz 


215 


Alpha 400 MHz 


58 


Alpha 500 MHz 


44 


APE (tower) 


6 


Alpha EV5, 600 Mhz [17] 


22 


SUE (single board) 


2.6 


SUE (twelve boards) 


0.22 



Table 3 

SUE performance 



6 Preliminary Physical Results 

In this section we present some preliminary results obtained with SUE. We 
have run an L = 20 lattice in 4 boards at 48 Mhz. We have simulated 1600 
sets of { Jij}, with 2 replicas at 12 different values of (3, which previously have 
been controlled to have a correct transfer probability between them with the 
parallel tempering method. We measure every 16384 sweeps, collecting 800 
measurements. These results have been obtained in 60 days. 

In Fig. 5 we plot the value of the average squared overlap. 
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T 

Fig. 5. Overlap in L = 20, as a function of T. The points correspond to the 12 
simulated T, and the lines are obtained from the spectral density method. 

The errors are plotted only at the simulations points. We are working around 
the critical region, as we reach high values for q 2 . The extrapolated lines con- 
nect properly and the different values evolve smoothly for different T values, 
as corresponds to a good thermalization and a high transition probability from 
parallel tempering. 

At the moment of writing, we are running L = 20 in 12 boards and almost 
finished the runs. Afterwards we will start the simulation in the L = 30 system, 
and estimate that the time needed to obtain good results is around one year. 
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